ggplot2“The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey
Visual representations help us to understand data quickly and share our results in an effective way. This lesson will teach you how to visualize your data using ggplot2. R has several systems for making graphs, but ggplot2 is one of the most elegant and most versatile. ggplot2 implements the grammar of graphics. This system of building graphs allows creating any kind of plot by specifying the essential building blocks which comprise it.
For example, we can break down this plot into its fundamental building blocks:
The opposite procedure, this is, by adding layers, is used by ggplot to build a graph. At least three layers are necessary to build a plot:
Data: the actual variables to be plotted.
Aesthetics: the scales onto which we will map our data.
Geometries: shapes used to represent our data.
Once the foundation of our plot is established, we can define more advanced features:
Facets: rows and columns of sub-plots.
Statistics: statistical models and summaries.
Coordinates: the plotting space we are using.
One limitation is that ggplot2 is designed to work exclusively with data tables in tidy format (where rows are observations and columns are variables). However, most data sets can be converted easily into this format. Well-structured data will save you lots of time when making figures with ggplot2.
The ggplot2 package is included in a popular collection of packages called tidyverse. So, first, install and load tidyverse. You only need to install a package once, but you need to reload it every time you start a new session.
install.packages("tidyverse",repos = "http://cran.us.r-project.org") # install package
library(tidyverse) # load library
iris data frameWe will use the data set iris to show the versatility of ggplot2. First, have a look at the structure of the data by applying the function glimpse() to the data set iris with the pipe operator %>%.
## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4,...
## $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7,...
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5,...
## $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2,...
## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa...
The iris data set contains the measurements in centimeters of the variables sepal length and width and petal length and width, for 50 flowers from each of 3 species of iris: Iris setosa, versicolor, and virginica. Note that the first 4 variables are numeric, while the variable Species is a factor of 3 levels.
We can also pipe the iris data set into the function summary() to obtain the main statistics of each variable.
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
ggplot2: step by stepggplot graphics are built step by step by adding new elements (layers) using the + sign. To build a ggplot, we will use the following basic template that can be applied for different types of plots:
ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) + <GEOM_FUNCTION>()
The first layer of our ggplot will be the data. We have to declare which data set we are going to use in the graph.
Use the ggplot() function to tell R that you want to create a plot, specify which data set you want to plot using the data argument.
A grey box will be shown, because we have declared which data set will be used only, the details like graph type and mapping information are missing. R does not know which graph type and what variable to display in the graph.
The next layer that we need to establish are the axes. We are interested in looking at the relationship between the sepal length and petal length, so this indicates what our axes are: Sepal.Lengthand Petal.Length.
In order to specify the axes, we need to use the aes() function. aes is short for “aesthetic”, and it is where we tell ggplot what columns we want to use for different parts of the plot. We are trying to look at relationship between the sepal length and the petal length, so this means that Sepal.Length will go to the x-axis and Petal.Length will go to the y-axis.
ggplot(data = iris, # specify the data set
mapping = aes(x = Sepal.Length, # specify the aesthetics mapping
y = Petal.Length))
With the addition of the aes() function, the graph now knows what columns to attribute to the axes.But notice that there’s still nothing on the plot! We still need to tell ggplot() what kind of shapes to use to visualize the relationship between Sepal.Lengthand Petal.Length.
Typically when we think of visualizations, we normally think about the type of graph since it’s really the shape that we see that tell us most of the information. While the ggplot2 package gives us a lot of flexibility in terms of choosing a shape to draw the data, it’s worth taking some time to consider which one is best for our question.
We are trying to visualize if there is a relationship between Sepal.Lengthand Petal.Length. For this, a scatter plot is great.
To create a scatter graph with ggplot(), we use the geom_point() function. A geom is the name for the specific shape that we want to use to visualize the data. All of the functions that are used to draw these shapes have geom in front of them. geom_line() creates a line graph, geom_point() creates a scatter plot, geom_boxplot() creates a box and whisker plot and so on.
To add a geom to the plot use the + operator. This is the way of adding more layers to the plot. Important: place the +operator always at the end of a line, placing it at the beginning will give an error.
ggplot(data = iris, # specify the data set
mapping = aes(x = Sepal.Length, # specify the aesthetics mapping
y = Petal.Length)) +
geom_point() # specify plot typeIt seems there is an association between the length of the sepals and the length of the petals (the longer the sepals, the longer the petals).
We could stop the plot here if we were just looking at the data quickly, but this is rarely the case. More common is that you’ll be creating a visualization for a report or for others on your team. In this case, the plot is not complete: if we were to give it to a teammate with no context, they would not understand the plot. Ideally, all of your plots should be able to explain themselves through the annotations and titles.
Currently the graph keeps the column names as the labels for both of the axes. We will want to change the axis labels to specify the unit of measurement. In order to change the axis labels for a plot, we can use the xlab() and ylab() functions and add them as a layer onto the plot. ggtitle() function can be added as another layer and allow us changing the title. Note that we must specify the new axis labels or title between quotation marks, otherwise the code will give us an error.
ggplot(data = iris, # data layer
mapping = aes(x = Sepal.Length, # axes layer
y = Petal.Length)) +
geom_point() + # geom layer
xlab("Sepal length (cm)") + # x-axis label
ylab("Petal length (cm)") + # y-axis label
ggtitle("Relationship between Sepal Length and Petal Length for iris species") This is our final polished graph. As we have seen, it is comprised of four layers: the data layer, the aesthetics mapping layer, the geom layer and the annotations layer. This process seems too verbose for the construction of a simple graph like this one. Indeed, we could have created a similar graph with the plot function incorporated in the basics of R. The plot function selects the best type of graph for your type of variables, e.g. boxplot for continuous vs. factor, scatter plot for continuous vs. continuous. In this case, we only need to call the function plot and specify the variables we would like to plot in each axis. Note that we have not specify the data frame we want to plot, therefore, we must indicate the variable to be plotted as dataframe$variablecolumn.
plot(x=iris$Sepal.Length, # Specify the x-axis variable
y=iris$Petal.Length) # Specify the y-axis variableThen, why use verbose ggplotinstead of simple plot? The next sections will make you judge for yourself.
This data set contains a factor Specieswith three levels: setosa, versicolor and virginica. We can easily see if the different species present different associations between sepal and petal length. A quick way of doing this is to color the dots according to the level of the Species factor. For this, we use color = Species.
ggplot(data = iris, # data layer
mapping = aes(x = Sepal.Length, # axes layer
y = Petal.Length, # axes layer
color = Species)) + # color by the Species factor
geom_point() + # geom layer
ggtitle("A. Scatter plot, colored according to species") # title layer We can add a new layer with a trend line. To do this, we specify a linear adjustment (“lm”) in the geom_smooth() function with the argument method, i.e. geom_smooth(method = "lm").
ggplot(data = iris, # data layer
mapping = aes(x = Sepal.Length, # axes layer
y = Petal.Length, # axes layer
color = Species)) + # color by the Species factor
geom_point() + # geom layer
geom_smooth(method = "lm") + # add linear trend
ggtitle("B. Scatter plot with linear trends, colored by species")# title layer As you can see, incorporating the grammar of ggplot2 allows for more complex visualizations with less effort. The richness of this packages is in the great variety of geom functions that we can incorporate.
aes()Besides the definition of the axis, the aes() function is used to tell ggplot2 how to draw the different lines, shapes, colors and sizes. By adding aes() to the ggplot() call we are sharing the information in all the layers. If we want that information to be in only one of the layers, we must use aes() in the corresponding layer, for example within the geom_point() call. This may seem confusing. Let’s explore the following lines of code to understand why they generate a plot (C) that looks identical to the plot A (two chunks above).
ggplot(data = iris, # data layer
mapping = aes(x = Sepal.Length, # axes layer
y = Petal.Length)) + # axes layer
geom_point(mapping = aes(color = Species)) + # Note the difference: aes() is now within geom_point()
ggtitle("C. Scatter plot, aes() within geom_point()") At first sight it is impossible to differentiate plot A from plot C. However, we can see the difference by trying to repeat the plot B (trend lines).
ggplot(data = iris, # data layer
mapping = aes(x = Sepal.Length, # axes layer
y = Petal.Length)) + # axes layer
geom_point(mapping = aes(color = Species)) + # dots color by Species
geom_smooth(method = "lm") + # linear trend for complete df
ggtitle("D. Scatter plot colored by Species, linear trend of whole data set") In this case, geom_smooth() does not receive the command to group according to Species, therefore, all data is used to build the linear adjustment. This behavior allows us great versatility in plots. However, it also leads the user to make some mistakes. For example, let’s try changing all the points in the first scatter plot from black to magenta:
First, we draw our reference plot with dots in black:
Our first attempt is to change the color of the dots of the scatter plot by changing the argument color within the general aesthetics in the ggplot () function. As you can see, this is a wrong approach. Specifying “magenta” within the aes() of ggplot() is interpreted as coloring all layers according the variable “magenta”, which obviously, does not exist in our data set. Note that aes() adds legends to the plots.
ggplot(iris,
mapping = aes(x = Sepal.Length,
y = Petal.Length,
color = "magenta")) + # color every layer according to variable "magenta"
geom_point()Let’s try to specify the color argument within the geom_point() function:
ggplot(iris,
mapping = aes(x = Sepal.Length,
y = Petal.Length)) +
geom_point(color = "magenta") # color the dots in magentaThis is what we wanted!
Play with aes() until you get familiar with it. The best way of learning R is by trial and error.
Note that size argument is outside the aesthetics mapping, what would have happened if we had put it inside?
ggplot(data = iris, # data layer
mapping = aes(x = Sepal.Length, # axes layer
y = Petal.Length)) + # axes layer
geom_point(mapping = aes(color = Species, # color dots by Species
shape = Species), # shape of dots by Species
size = 4) # size of dotsgeomThe scatter plot is a good option to plot two continuous variables. However, other type of geometries are more suitable when we are dealing with factors, discrete variables with levels, such as Species. The different geom functions allow us obtaining different graphic results using the same data. Let’s analyze Sepal.Length (continuous variable) by species (factor).
We could plot these two variables with geom_point(), although this is not a really good idea, because some of the data are overlapped and we are losing information on the distribution of the observations.
ggplot(iris, # data layer
mapping = aes(x = Species, # x-axis
y = Sepal.Length)) + # y-axis
geom_point(mapping =
aes(color = Species)) # geom layer, points, colored by SpeciesA way of solving this is to use geom_jitter(), this function creates a dots chart with noise in the horizontal axis. This avoids overlapping and shows which values of sepal length gather more observations. With the argument width we specify the width of the horizontal spread of the points.
ggplot(iris, # data layer
mapping = aes(x = Species, # x-axis
y = Sepal.Length)) + # y-axis
geom_jitter(mapping = # geom layer, non-overlapped points
aes(color = Species), # colored by Species
width = 0.1) # jitter width To summarize the information contained in the data, is common to use boxplots. This type of diagram shows the distribution of the data, median, interquartile range (IQR) and min, max and outliers.
ggplot(iris, # data layer
mapping = aes(x = Species, # x-axis
y = Sepal.Length)) + # y-axis
geom_boxplot(mapping = # geom layer, boxplot
aes(fill = Species)) # filled by SpeciesThe size of the boxes shows that the variance in the sepal length values is narrower in the species setosa than in versicolor and virginica. The median (the middle value of the data set) of virginica species is higher than that of versicolor species, and the latter is higher than setosa. There is one outlier in the virginica species, represented by a dot, that value is more extreme than 1.5 times the IQR.
We can also use a violin plot to represent the distribution of the data.
ggplot(iris, # data layer
mapping = aes(x = Species, # x-axis
y = Sepal.Length)) + # y-axis
geom_violin(mapping = # geom layer, violin plot
aes(fill = Species)) # filled by Species The violin plot shows the full distribution of the data. It shows the probability density of the data at different values. For example, a sepal length of 5 cm is highly frequent in the setosa species.
We could combine both plots into one by adding two geom layers. We specify some aesthetic attributes within the geom layers to increase the contrast between them. Within the geom_violin() layer we change the filling color with the attribute fill, and the opacity with the alpha attribute. Alpha takes values from 0 to 1, 0 being totally transparent and 1 being totally opaque. Within the geom_boxplot() layer we change the color of the edge line with the color attribute, the filling color with fill, the width of the outside line with lwd and the width of the boxplot with width.
ggplot(iris, # data layer
mapping = aes(x = Species, # x-axis
y = Sepal.Length)) + # y-axis
geom_violin(fill='orange', # geom layer, violin filled in orange
alpha=0.5) + # specify opacity with alpha
geom_boxplot(color="white", # geom layer, boxplot, edge line color
fill="black", # filling color
lwd=0.8, # edge line width
width=0.2 ) # boxplot width Does it matter the order of the layers? It matters! We must introduce the layers we want to see after the background layers. Try at home to set the boxplot layer before the violin layer, can you see the boxplot?
Other common geometries for a first glimpse at the data are: geom_histogram() and geom_density().
ggplot(iris, # data layer
mapping = aes(x = Sepal.Length)) + # x-axis
geom_histogram(mapping = # geom layer, histogram
aes (fill = Species)) # fill by Speciesggplot(iris, # data layer
mapping = aes(x = Sepal.Length)) + # x-axis
geom_density(mapping = # geom layer, probability function
aes (fill = Species), # fill by Species
alpha = 0.5) # half opacityAnother common way to make a data summary is a bar plot of type mean ± SEM (standard error of the mean). For this, we will have to add two layers, a geom_bar() layer and a geom_errorbar() layer. geom_bar() , by default, displays the count of observations in each group. However, if we specify "summary" within the argument stat, the bars will use a transformation of the original data, in this case, a summary statistics. Use the argument fun. y to specify which kind of statistics you want to display, the mean, median… For the error bar we use the argument fun.data because the SEM has two values, the upper limit and the lower limit.
ggplot(iris, # data layer
mapping = aes(x = Species, # x-axis
y = Sepal.Length, # y-axis
fill = Species)) + # fill by Species levels
geom_bar(stat = "summary", # bar for each Species
fun.y = "mean", # bar height is mean
width=0.3) + # bar width
geom_errorbar(stat = "summary", # error bar for each Species
fun.data = "mean_se", # error is SEM
color="black", # edge line in black
width=0.15) # error bar width A common combination for plots are lines and points. We will try this by plotting the vapor pressure of mercury as a function of temperature.
ggplot(data = pressure, # data layer
mapping = aes(x = temperature, # x-axis
y = pressure)) + # y-axis
geom_line(linetype=2, # geom layer, line, specify line type
size = 0.7) + # line size
geom_point(color = "red", # geom layer, point, color of the points
shape = 20, # shape of the points
size = 4) # size of the pointsWe could add other types of lines:
geom_abline() draws a line defined by an intercept x and a slope.ggplot(data = pressure, # data layer
mapping = aes(x = temperature, # x-axis
y = pressure)) + # y-axis
geom_point() + # geom layer, point
geom_abline(x = 0, slope = 2.24, # geom layer, diagonal line
color = "red") # line colorgeom_hline() draws a horizontal line at the yinterceptposition.ggplot(data = pressure, # data layer
mapping = aes(x = temperature, # x-axis
y = pressure)) + # y-axis
geom_point() + # geom layer, point
geom_hline(yintercept = 100, # geom layer, horizontal line at y = 100
color = "red") # line colorgeom_vline() draws a vertical line at the xinterceptposition.ggplot(data = pressure, # data layer
mapping = aes(x = temperature, # x-axis
y = pressure)) + # y-axis
geom_point() + # geom layer, point
geom_vline(xintercept = 300, # geom layer, vertical line at x = 300
color = "red") # line colorWe have the data set precip, which summarizes the monthly precipitations of 2017 and 2018 in Palau.
precip <- data.frame(Month = factor(month.abb, levels=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")), # vector with month abbreviations, specify is a factor and the order of the levels
Precipitation_2017 = c(34, 43.6, 79.8, 29.4, 17.2, 13.6, 6.4, 18.0, 64.0, 65.8, 5.4, 6.4),
Precipitation_2018 = c(59.6, 77.4, 74.4, 80.4, 45.8, 47, 13, 32.8, 14.8, 120.6, 94.6, 8.4))
precip We want to plot this information as a stacked bar plot, but the data is not in the long format that ggplotneeds. Let’s shape it to the way ggplot likes it.
Use the function pivot_longer() to shape the data frame to the long format. Specify the data frame you want to change with the data argument. Specify the columns you want to convert in the long format with argument cols, in this case, two columns must the converted, so we use a vector define with c(column 1,colum 2). Lastly, define the name of the new factor column with names_to attribute, and the name of the new numeric variable with values_to argument.
precip_long <- pivot_longer(data = precip,
cols = c("Precipitation_2017", "Precipitation_2018"),
names_to = "Year",
values_to = "Precipitation")
precip_longNow we are ready to plot the stacked bar graph.
For geom_bar(), the default behavior is to count the rows for each x value. By specifying stat = "identity" we are telling ggplot2 to skip the count and that we will provide the y values. Specifying position = "stack" we choose to display a staked bar plot. There are other options for the positionargument: “fill”, “dodge” or “jitter”. Try them!
ggplot(precip_long, # data layer
mapping = aes(x = Year, # x-axis
y = Precipitation, # y-axis
fill = Month)) + # filling factor
geom_bar(stat = "identity", # geom layer, bar
position = "stack") # stacked bar There it is! Our stacked bar graph. But maybe you prefer to see the bars side by side to identify better the differences between months. For this, use position = "dodge".
ggplot(precip_long, # data layer
mapping = aes(x = Year, # x-axis
y = Precipitation, # y-axis
fill = Month)) + # filling factor
geom_bar(stat = "identity", # geom layer, bar
position = "dodge") # bars side by side By default ggplot() uses the system of Cartesian coordinates. It maker no difference if you take coord_cartesian() out of the code.
# Subset precip_long data frame to get only data from year 2018
Precipitation_2018 <- precip_long[precip_long$Year=="Precipitation_2018", ]
# Plot accumulated precipitation from year 2018
ggplot(Precipitation_2018, # data layer
mapping = aes(x = Year, # x-axis
y = Precipitation, # y-axis
fill = Month)) + # filling factor
geom_bar(stat = "identity") + # geom layer, bar
coord_cartesian() # default coordinate system We could flip the coordinates with
coord_flip(), so that the variable mapped to x is used for the y-coordinates and the variable mapped to y is used for the x-coordinates.
# Subset precip_long data frame to get only data from year 2018
Precipitation_2018 <- precip_long[precip_long$Year=="Precipitation_2018", ]
ggplot(Precipitation_2018, # data layer
mapping = aes(x = Year, # x-axis
y = Precipitation, # y-axis
fill = Month)) + # filling factor
geom_bar(stat = "identity") + # geom layer, bar
coord_flip() # changes y by x We could also use polar coordinates with
coord_polar(). We represent each partition as an angle theta.
ggplot(Precipitation_2018, # data layer
mapping = aes(x = Year, # x-axis
y = Precipitation, # y-axis
fill = Month)) + # filling factor
geom_bar(stat = "identity") + # geom layer, bar
coord_polar(theta="y") # polar coordinates, angle defined by facetIn other sections we separated categorical data (factors) using color in the aesthetics mapping layer. In this section, we will see that it is possible to use facets to split one plot into multiple plots (windows or vignettes) based on a factor included in the data set.
We will use the data set CO2.
## Plant Type Treatment conc uptake
## Qn1 : 7 Quebec :42 nonchilled:42 Min. : 95 Min. : 7.70
## Qn2 : 7 Mississippi:42 chilled :42 1st Qu.: 175 1st Qu.:17.90
## Qn3 : 7 Median : 350 Median :28.30
## Qc1 : 7 Mean : 435 Mean :27.21
## Qc3 : 7 3rd Qu.: 675 3rd Qu.:37.12
## Qc2 : 7 Max. :1000 Max. :45.50
## (Other):42
This data set collects the CO2 uptake of plants originated in Quebec or Mississippi measured at different CO2 concentrations.
First, create a basic boxplot to compare the CO2 uptake at different CO2 concentrations.
For this, we need R to interpret conc variable as a factor with levels and not as a numeric variable.
## [1] "numeric"
CO2 <- CO2 %>% # pipe CO2 data set in the next step
mutate_at(vars("conc"), as.factor) # convert conc variable to a factor## [1] "factor"
## [1] "95" "175" "250" "350" "500" "675" "1000"
Let’s plot:
ggplot(CO2, # data layer
mapping = aes(x = conc, # x-axis
y = uptake, # y-axis
fill = conc)) + # filling color by concentration factor
geom_boxplot() # geom layer, boxplotAs we can observe, the CO2 uptake increases the higher the CO2 concentration until it reaches a plateau close to 350 mL/L.
We could be interested in comparing the CO2 uptake measured at different CO2 concentrations, and finding out if there is an interaction with the plant origin.
The graph can be partitioned in multiple panels by levels of the group Type adding the layer facet_grid(). We can split in vertical direction (factor~.) or in horizontal direction(.~factor).
ggplot(CO2, # data layer
mapping = aes(x = conc, # x-axis
y = uptake, # y-axis
fill = conc)) + # filling color by concentration
geom_boxplot() + # geom layer, boxplot
facet_grid(Type~.) # Split in vertical direction by Type factorggplot(CO2, # data layer
mapping = aes(x = conc, # x-axis
y = uptake, # y-axis
fill = conc)) + # filling color by concentration
geom_boxplot() + # geom layer, boxplot
facet_grid(.~Type) # Split in horizontal direction by Type factorNow that we have split the data set by the origin, we can observe there are two different dynamics. The CO2 uptake of plants coming from Quebec is always higher than those coming from Mississippi, independently of the CO2 concentration. However, plants coming from Mississippi reach a plateau at a concentration of 350 mL/L, whereas plants from Quebec keep increasing the uptake the higher the concentration.
We can partition the graph by levels of the groups Type and conc. We must specify within facet_grid() the matrix display we want as follows: (factor 1 ~ factor 2). Factor 1 defines the rows and factor 2 defines the columns.
# Facet by two variables: Type and conc
ggplot(CO2, # data layer
mapping = aes(x = conc, # x-axis
y = uptake, # y-axis
fill = conc)) + # filling color by Type
geom_boxplot() + # geom layer, boxplot
facet_grid(conc~Type) # Rows are conc and columns are TypeWe could reverse the order of the two variables, but see what happens in the x-axis.
# Facet by two variables: reverse the order of the 2 variables
ggplot(CO2, # data layer
mapping = aes(x = conc, # x-axis
y = uptake, # y-axis
fill = conc)) + # filling color by Type
geom_boxplot() + # geom layer, boxplot
facet_grid(Type~conc) # Rows are Type and columns are concTo fix the overlapping between labels in the x-axis we can change the angle of the labels of the x-axis within the layer theme.
# Fix the overlapping between labels in the x-axis
ggplot(CO2, # data layer
mapping = aes(x = conc, # x-axis
y = uptake, # y-axis
fill = conc)) + # filling color by Type
geom_boxplot() + # geom layer, boxplot
facet_grid(Type~conc) + # Rows are Type and columns are conc
# change angle of the text in x axis
theme(axis.text.x = element_text(angle = 90)) One of the coolest features of ggplot is the multiple ways to customize the look of plots.
There are many color palettes available, e.g.:
We only need to add a scale layer to our ggplot and choose the palette we like.
It is important to choose the right scale function depending on our type of data (discrete or continuous) and on the aesthetics we have defined (fill, color, size …).
Let’s see some examples:
ggplot(data = iris, # data layer
mapping = aes(x = Petal.Length, # axes layer
fill = Species)) + # filling by the Species factor
geom_histogram () + # geom layer
scale_fill_brewer(palette = "Set3") # define palette for fillingggplot(data = iris, # data layer
mapping = aes(x = Petal.Length, # axes layer
fill = Species)) + # filling by the Species factor
geom_dotplot () + # geom layer
scale_fill_grey() # grey paletteggplot(data = iris, # data layer
mapping = aes(x = Sepal.Length, # axes layer
y = Petal.Length, # axes layer
color = Species)) + # color by the Species factor
geom_point(size = 2) + # geom layer
scale_color_brewer(palette = "Dark2") # define palette for colorggplot(data = iris, # data layer
mapping = aes(x = Sepal.Length, # axes layer
y = Petal.Length, # axes layer
color = Sepal.Length)) + # color by Sepal.Length
geom_point(size = 2) + # geom layer
scale_color_distiller(palette = "YlGnBu") # define palette for colorThere are many pre.defined themes to change the appearance of your plots:
theme_classic()ggplot(data = iris, # data layer
mapping = aes(x = Sepal.Length, # axes layer
y = Petal.Length, # axes layer
color = Sepal.Length)) + # color by Sepal.Length
geom_point(size = 2) + # geom layer
scale_color_distiller(palette = "YlGnBu") + # define palette for color
theme_classic() # specify themetheme_dark()ggplot(data = iris, # data layer
mapping = aes(x = Sepal.Length, # axes layer
y = Petal.Length, # axes layer
color = Sepal.Length)) + # color by Sepal.Length
geom_point(size = 2) + # geom layer
scale_color_distiller(palette = "YlGnBu") + # define palette for color
theme_dark() # specify themetheme_minimal()ggplot(data = iris, # data layer
mapping = aes(x = Sepal.Length, # axes layer
y = Petal.Length, # axes layer
color = Sepal.Length)) + # color by Sepal.Length
geom_point(size = 2) + # geom layer
scale_color_distiller(palette = "YlGnBu") + # define palette for color
theme_minimal() # specify themeYou can save your plots in several formats (.png, .jpg, .pdf) with the function ggsave().
ggsave() saves the last plot as width’ x heigth’ file named “plot.png” in your working directory. It matches the file type to the file extension.
ggplot(data = iris, # data layer
mapping = aes(x = Sepal.Length, # axes layer
y = Petal.Length, # axes layer
color = Sepal.Length)) + # color by Sepal.Length
geom_point(size = 2) + # geom layer
scale_color_distiller(palette = "YlGnBu") # define palette for colorggsave(file = "petal_lengthvssepal_length.png", # save file in your wd
width = 5,
height =5)
ggsave(file = "E:/USB DOCTORADO/Clases IAMZ2020/Visualization/petal_lengthvssepal_length.png", # save to other directory
width = 5,
height =5)Now you know how powerful this tool can be. However, it takes time to get used to the grammar and a lot of practice to obtain the perfect plot you have in your mind. Do not hesitate and start trying!
Print this Ggplot cheatsheet and check it any time you need it.
Plotly: Interactive and more eye-catching graphicsAs always with R, more and more libraries or packages are appearing that improve some features of the previous ones. The package plotly allows building interactive plots.
Let’s see several examples:
## package 'plotly' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\miria\AppData\Local\Temp\Rtmp8etcCq\downloaded_packages
plot_ly(data = iris, # data set
x = ~Sepal.Length, # x-axis
y = ~Petal.Length, # y-axis
color = ~Species) # color by Species